Integral Reinforcement Learning for Finding Online the Feedback Nash Equilibrium of Nonzero-Sum Differential Games
نویسندگان
چکیده
Adaptive/Approximate Dynamic Programming (ADP) is the class of methods that provide online solution to optimal control problems while making use of measured information from the system and using computation in a forward in time fashion, as opposed to the backward in time procedure that is characterizing the classical Dynamic Programming approach (Bellman, 2003). These methods were initially developed for systems with finite state and action spaces and are based on Sutton’s temporal difference learning (Sutton, 1988), Werbos’ Heuristic Dynamic Programming (HDP) (Werbos, 1992), and Watkins’ Qlearning (Watkins, 1989). The applicability of these online learning methods to real world problems is enabled by approximation tools and theory. The value that is associated with a given admissible control policy will be determined using value function approximation, online learning techniques, and data measured from the system. A control policy is determined based on the information on the control performance encapsulated in the value function approximator. Given the universal approximation property of neural networks (Hornik et al., 1990), they are generally used in the reinforcement learning literature for representation of value functions (Werbos, 1992), (Bertsekas and Tsitsiklis, 1996), (Prokhorov and Wunsch, 1997), (Hanselmann et al., 2007). Another type of approximation structure is a linear combination of a basis set of functions and it has been used in (Beard et al., 1997), (Abu-Khalaf et al., 2006), (Vrabie et al. 2009). The approximation structure used for performance estimation, endowed with learning capabilities, is often referred to as a critic. Critic structures provide performance information to the control structure that computes the input of the system. The performance information from the critic is used in learning procedures to determine improved action policies. The methods that make use of critic structures to determine online optimal behaviour strategies are also referred to as adaptive critics (Prokhorov and Wunsch, 1997), (Al-Tamimi et al., 2007), (Kulkarni & Venayagamoorthy, 2010). Most of the previous research on continuous-time reinforcement learning algorithms that provide an online approach to the solution of optimal control problems, assumed that the dynamical system is affected only by a single control strategy. In a game theory setup, the controlled system is affected by a number of control inputs, computed by different controllers
منابع مشابه
Data-Based Reinforcement Learning Algorithm with Experience Replay for Solving Constrained Nonzero-Sum Differential Games
In this paper a partially model-free reinforcement learning (RL) algorithm based on experience replay is developed for finding online the Nash equilibrium solution of the multi-player nonzero-sum (NZS) differential games. In order to avoid the performance degradation or even system instability, the amplitude limitation on the control inputs is considered in the design procedure. The proposed al...
متن کاملVehicle Routing Problem in Competitive Environment: Two-Person Nonzero Sum Game Approach
Vehicle routing problem is one of the most important issues in transportation. Among VRP problems, the competitive VRP is more important because there is a tough competition between distributors and retailers. In this study we introduced new method for VRP in competitive environment. In these methods Two-Person Nonzero Sum games are defined to choose equilibrium solution. Therefore, revenue giv...
متن کاملLearning in Markov Games with Incomplete Information
The Markov game (also called stochastic game (Filar & Vrieze 1997)) has been adopted as a theoretical framework for multiagent reinforcement learning (Littman 1994). In a Markov game, there are n agents, each facing a Markov decision process (MDP). All agents’ MDPs are correlated through their reward functions and the state transition function. As Markov decision process provides a theoretical ...
متن کاملStochastic nonzero-sum games: a new connection between singular control and optimal stopping
In this paper we establish a new connection between a class of 2-player nonzerosum games of optimal stopping and certain 2-player nonzero-sum games of singular control. We show that whenever a Nash equilibrium in the game of stopping is attained by hitting times at two separate boundaries, then such boundaries also trigger a Nash equilibrium in the game of singular control. Moreover a different...
متن کاملFriend-or-Foe Q-learning in General-Sum Games
This paper describes an approach to reinforcement learning in multiagent general-sum games in which a learner is told to treat each other agent as either a \friend" or \foe". This Q-learning-style algorithm provides strong convergence guarantees compared to an existing Nash-equilibrium-based learning rule.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011